Biostatistics For Dummies (Monika Wahi John Pezzullo)

Examine the smallest and largest values in numerical data: Have the software show you the

smallest and largest values for each numerical variable. This check can often catch decimal-point

errors (such as a hemoglobin value of 125 g/dL instead of 12.5 g/dL) or transposition errors (for

example, a weight of 517 pounds instead of 157 pounds).

Sort the values of variables: If your program can show you a sorted list of all the values for a

variable, that’s even better — it often shows misclassified categories as well as numerical

outliers.

Search for blanks and commas: You can have Excel search for blanks in category values that

shouldn’t have blanks, or for commas in numeric variables. Make sure the “Match entire cell

contents” option is deselected in the Find and Replace dialog box (you may have to click the

Options button to see the check box). This operation can also be done using statistical software. Be

wary if there a large number of missing values, because this could indicate a data collection

problem.

Tabulate categorical variables: You can have your statistics program tabulate each categorical

variable (showing you the frequency each different category occurred in your data). This check

usually finds misclassified categories. Note that blanks and special characters in character

variables may cause incorrect results when querying, which is why it is important to do this check.

Spot-checking data entry: If doing data entry from forms or printed material, choose a percentage

to double-check (for example, 10 percent of the forms you entered). This can help you tell if there

are any systematic data entry errors or missing data.

Creating a File that Describes Your Data File

Every research database, large or small, simple or complicated, should include a data dictionary that

describes the variables contained in the database. It is a necessary part of study documentation that

needs to be accessible to the research team. A data dictionary is usually set up as a table (often in

Excel), where each row provides documentation for each variable in the database. For each variable,

the dictionary should contain the following information (sometimes referred to as metadata, which

means “data about data”):

A variable name (usually no more than ten characters) that’s used when telling the software what

variables you want it to use in an analysis

A longer verbal description of the variable in a human-readable format (in other words, a person

reading this description should be able to understand the content of the variable)

The type of data (text, categorical, numerical, date/time, and so on)

If numeric: Information about how that number is displayed (how many digits are before

and after the decimal point)

If date/time: How it’s formatted (for example, 12/25/13 10:50pm or 25Dec2013 22:50)

If categorical: What codes and descriptors exist for each level of the category (these are

often called picklists, and can be documented on a separate tab in an Excel data dictionary)

How missing values are represented in the database (99, 999, “NA,” and so on)